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This article presents an analysis of the influence of context information on dialog act recognition. 
We performed experiments on the widely explored Switchboard corpus, as well as on data anno¬ 
tated according to the recent ISO 24617-2 standard. The latter was obtained from the Tilburg 
DialogBank and through the mapping of the annotations of a subset of the Let's Go corpus. 
We used a classification approach based on SVMs, which had proved successful in previous 
work and allowed us to limit the amount of context information provided. This way, we were 
able to observe the influence patterns as the amount of context information increased. Our base 
features consisted ofn-grams, punctuation, and wh-words. Context information was obtained 
from one to five preceding segments and provided either as n-grams or dialog act classifications, 
with the latter typically leading to better results and more stable influence patterns. In addition 
to the conclusions about the importance and influence of context information, our experiments 
on the Switchboard corpus also led to results that advanced the state-of-the-art on the dialog act 
recognition task on that corpus. Furthermore, the results obtained on data annotated according to 
the ISO 24617-2 standard define a baseline for future work and contribute for the standardization 
of experiments in the area. 

1. Introduction 

As [Searle (1969| stated, dialog, speech, or illocutionary acts are the minimal units of 
linguistic communication, as they reveal the intention behind the uttered words. Thus, 
automatic dialog act recognition is an important task in Natural Language Understand¬ 
ing (NLU), as by identifying fhe infenfion of fhe conversational partner the interpre¬ 
tation process is simplified. This is particularly imporfanf for fhe developmenf of more 
robusf and nafural dialog sysfems, since many communicafion problems fhaf occur wifh 
existing sysfems are due fo misinferprefafions of ambiguous utterances, which could be 
disambiguafed if fhe infenfion was correcfly identified. 
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During a conversation, the communicative intention of a speaker and, thus, the 
uttered words, depends on the current state of the dialog ( Stone 2002[ |. For instance, 
if the conversation is just starting, the speaker will probably utter a greeting due to 
politeness conventions. Also, if the conversational partner asked a question, the speaker 
will probably intend to answer it, although, in some cases, he or she may choose to 
just simply ignore it for some reason. Nonetheless, either way, this means that dialog 
acts are influenced by the state and context of the dialog. However, the speaker's 
communicative intention is not usually related to the fact that the conversational partner 
asked a question ten turns before. This suggests that the range of influence of previous 
utterances is limited, or, at least, that the influence of a given utterance decays as the 
conversation evolves. 

In this article, we study the influence of context information, extracted from pre¬ 
vious segments in multiple ways, on dialog act recognition. With our experiments, we 
assess the importance of such information for the task, its range of influence, and the 
best ways to represent it. 

Some dialog act recognition approaches (Section |2.2| try to predict the best dialog 
act sequence for a whole dialog and, thus, rely not only on past and present information, 
but also on future information to classify each segment. Although such approaches 
have applications, they are not useful for live interactions, since the system does not 
have access to future information at the time it needs to assess the intention of its 
conversational partner. Thus, our studies do not explore future information and rely 
only on information extracted from the current and past segments. 

The remaining sections of this article are organized as follows: Section [^provides 
related work on dialog act recognition by presenting an overview of existing annotated 
data, as well as describing multiple approaches and state-of-the-art results for different 
corpora. Section ^ describes the datasets used in our experiments. Section defines 
our experimental setup by describing the used features, classification approach, and 
evaluation methodology. Section presents a comparative evaluation of the obtained 
results, both among the different approaches and datasets, and with the state-of-the-art. 
Finally, the achieved results are discussed in Section and directions for future work 
are presented in Section]^ 


2. Related Work 


Dialog act recognition is a classification task that attributes a dialog act label to each di¬ 
alog segment. In this sense, multiple classification approaches have been applied to this 
task. To our knowledge, all of them were supervised approaches. This means that large 
amounts of annotated data are required to obtain solid models. Thus, corpora selection 
plays an important role in the task. In terms of evaluation, previous studies relied solely 
on accuracy as performance measure. Below, we analyze different annotated corpora 
and classification approaches that were previously applied on the task. 

2.1 Data Annotation 

Multiple corpora have been annotated in terms of dialog acts. Table presents some 
of those corpora and their characteristics. We can see that multiple domains, languages, 
and kinds of interaction are covered, which enables portability experiments and domain 
and interaction-independent conclusions. However, on the other hand, the used tag 
sets are not standardized among corpora. While some of the corpora, such as DCIEM 
Map Task ([Bard et al. 1995]l and AMI Meeting ([Carletta et al. 2006|, were annotated 
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using reduced tag sets under 20 tags, others, such as Dihana ( [Benedi et al. 200^ and 
NESPOLE (Costantini, Burger, and Pianesi 2002}, were annotated using tag sets with 


hundreds of tags. Purthermore, while some - DCIEM Map Task, Switchboard (Juratsky, 
Shriberg, and Biasca 199^ , SCHISMA ( Keizer 2002) , ICSI-MRDA (Shriberg et al. 20041, 


and AMI Meeting - were armotated using domain-independent tag sets that can be used 
for armotating any corpora, others - VERBMOBIL (|Kay, Gawron, and Norvig 1994 1 , 
NESPOLE, Dihana, and LEGO ( jSchmitt, Ultes, and Minker 2012 1 - were armotated 
using domain-dependent tag sets, which are limited to corpora in that domain. This 
means that the tag sets were developed with different objectives and have different 
hierarchies and levels of abstraction, which makes cross-corpora and generalization 
experiments hard to perform. Also, while some of the corpora, such as Switchboard, 
ICSI-MRDA, and AMI Meeting, have several tens of thousands of armotated segments, 
others, like SCHISMA, have less than a thousand. 


Table 1 

Characteristics of some corpora armotated in terms of dialog acts. All corpora have at least one 
human speaker in each dialog, thus, the Interaction column refers to the nature of the other 
speaker. The last column, DD, states whether the tag set is domain-dependent or not. 


Corpus 

Interaction 

Domain 

Language 

Segments 

#Tags 

DD 

VERBMOBIL 

Human 

Schedules 

Multiple 

58961 

72 

YES 

DCIEM Map Task 

Human 

Routes 

English 

4787 

12 

NO 

Switchboard 

Human 

Open 

English 

223606 

44 

NO 

SCHISMA 

Wizard 

Theatre 

Dutch 

440 

64 

NO 

NESPOLE 

Human 

Tourism 

Multiple 

12565 

1168 

YES 

ICSI-MRDA 

Human 

Meetings 

English 

105000 

55 

NO 

AMI Meeting 

Human 

Meetings 

English 

102198 

15 

NO 

Dihana 

Wizard 

Trains 

Spanish 

23008 

248 

YES 

LEGO 

Machine 

Buses 

English 

14186 

50 

YES 


In an attempt to standardize dialog act annotation and, thus, set the ground for more 
comparable research in fhe area. Bunt et al. (2012) defined fhe ISO 24617-2 sfandard. The 
first thing that should be noted in the standard is that armotations should be performed 
on functional segments rather than on turns or utterances (jCarroll and Tanenhaus 19781. 
This should happen because a single turn or utterance may have multiple functions, 
revealing different intentions. However, automatic fimctional segmentation is a com¬ 
plex task on its own. Thus, according to the standard, dialog act annotation does not 
consist of a single label, buf rather of a complex strucfure containing information about 
the participants, relations with other functional segments, the semantic dimension of 
fhe dialog act, its communicative function, and optional qualifiers concerning certainfy, 
conditionalify partialify, and senfimenf. In ferms of semantic dimensions, the standard 
defines nine - Task, Auto-Feedback, Allo-Feedback, Turn Management, Time Management, 
Discourse Structuring, Own Communication Management, Partner Communication Manage¬ 
ment, and Social Obligations Management. Communicative functions are equivalent to the 
dialog act labels present in the multiple tag sets used to armotate the corpora presented 
in Table They were divided into general-purpose functions, which can occur in any 
semantic dimension, and dimension-specific functions, which, as the name indicates, 
are specific to a certain dimension. The set of general-purpose functions is hierarchically 
distributed according to Table Dimension-specific functions are all at the same level 
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and are distributed across dimensions according to Table This means that the Task 
dimension contains general-purpose functions only. 


Table 2 

Distribution of general-purpose functions according to the ISO 24617-2 standard. 


Function Count 

Information-seeking 4 

Information-providing 6 

Commissive 4 

Directive 5 


Table 3 

Distribution of dimension-specific functions according to the ISO 24617-2 standard. 


Dimension Count 

Auto-feedback 2 

Allo-feedback 3 

Turn Management 6 

Time Management 2 

Discourse Structuring 2 

Own Communication Management 2 

Partner Communication Management 3 

Social Obligations Management 10 


2.2 Dialog Act Recognition 


Existing approaches for dialog act recognition can be split into two categories. The ones 
that try to predict the best dialog act sequence for a given set of segments and the ones 
that predict the dialog act of each segment individually. The approaches in the first cate¬ 
gory take advantage of algorithms such as Conditional Random Fields (CRFs) (Lafferty, 
McCallum, and Pereira 2001| |, Hidden Markov Models (HMMs) (|Baum and Petrie 1966) , 
and other Bayesian Networks (BNs) (iFriedman, Geiger, and Goldszmidt 19971. On 
the other hand, approaches in the second category take advantage of algorithms such 
as Neural Networks ( [McCulloch and Pitts 19^, Decision Trees ( jBreiman 1984| |, and 
Support Vector Machines (SVMs) ([Cortes and Vapnik 1995}. We could organize related 
work according to these two categories. However, since experiments are spread among 
multiple corpora, it would be difficult to compare the different approaches. Thus, we 
opted to organize related work by corpora, meaning that experiments performed on, at 
least, similar data are presented together, allowing easier comparison. 

Switchboard is probably the most explored corpus for the dialog act recognition 
task. However, multiple variations of the original 44-label tag set have been used, 
differing mainly on how abandoned, umecognized, and interrupted segments are dealt 
with. Thus, the number of tags varies between 41 and 44. The first experiments on this 
corpus were performed by [Stolcke et al. (2000) , using word n-grams as features for an 
HMM and a 42-label variant of the tag set. Using manual transcriptions, the best result, 
71.0% accuracy, was obtained using trigrams. This value decreased to 64.8% when using 
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automatic transcriptions. The same authors also used Decision Trees based on prosodic 


information, but only achieved 49.7% accuracy. Later, 

Rotaru (2002) used a Memory- 

Based Learning approach (Aha, Kibler, and Albert 1991 

1 to obtain 72.32% accuracy, also 


using the 42-label variant of the tag set. He used the k-NN algorithm (Cover and Hart 
1967| with the distance between neighbors being measured as the number of common 


bigrams between utterances, according to a hash function. [Sridhar, Bangalore, and 


Narayanan (2009) used a maximum entropy model combining lexical, S 5 mtactic, and 


prosodic features, as well as context information extracted from fhe fhree previous seg¬ 
ments. In terms of lexical and S 5 mtacfic features, the authors used word n-grams, POS 


tags, and Supertags, the enriched descriptions of lexical items proposed by Bangalore 
and Joshi (1999| . As for acoustic-prosodic features, they used pitch, energy, and accent 
and boundary tone labels. Context information was provided by extracting the same 
features from the surrounding segments, as well as in the form of the dialog act labels 
for those segments. The experiments were performed using the 42-label variant of the 
tag set, as well as a compressed version with 7 labels. Using the 42-label variant, the 
authors achieved 70.4% accuracy without context information, and 76.0% when it was 
included from preceding segments. These values decreased to 55.1% and 59.7% when 
automatically recognized segments were used instead of manual franscripfions. Using 
fhe compressed version, the results increased to 82.5% and 83.1%, and 69.9% and 73.9%, 
respectively. The authors also performed experiments using information extracted from 
the next three segments, achieving 71.3% and 56.1% on the 42-label variant, and 82.8% 
and 70.7% on the compressed version. Webb and Ferguson (2010) were able to achieve 
80.72% accuracy by applying a classification approach based on cue phrases, that is, 
phrases that are highly indicative of a particular dialog act. However, they used a 41- 
label variant of the tag set, merging different kinds of statement into a class covering 
49% of the corpus. Finally, Gamback, Olsson, and Tackstrom (2011[ | used SVMs, together 
with an active learning approach to select the most informative subset of the training 
data, to obtain 76.50%, 76.34%, and 77.85% accuracy on the 42,43, and 44-label variants 
of the tag set, respectively. The used features included multiple n-grams, punctuation, 
and wh-words, as well as some context information in the form of n-grams from fhe 
previous segments. 

On the DCIEM Map Task Corpus, experiments were performed by |Wright (1998 
using three different approaches with similar results. The combination of HMMs and an 
intonation model achieved 64% accuracy, while Decision Trees trained with the CART 
algorithm ( Breiman 1984) ac hieved 63% accuracy. Additionally, a Multi-Layer Percep- 
tron (MLP) ( Rosenblatt 1962| with one hidden layer, with suprasegmental and prosodic 
features based on duration as inputs achieved 62% accuracy. [Sridhar, Bangalore, and 


Narayanan (2009ll also performed experiments on this corpus, using the same approach 


described for the Switchboard corpus. However, in this case, only manual transcriptions 
were used. They achieved 66.6% without context information, 72.5% using information 
from fhe previous segmenfs, and 67.4% using information from fhe following segments. 

The NESPOLE corpus was explored by jLevin et al. (2003 1. The presence or absence 
of grammar characteristics - 212 for English and 259 for German - was used as a set 
of binary features for four different classification approaches. Memory-Based Learning, 
through the application of the IBl algorithm ( |Aha, Kibler, and Albert 1991| , achieved 
69.82% accuracy for English and 67.57% for German. Decision Trees trained with the 
C4.5 algorithm ( [Quinlan 1993) achieved 70.41% for English and 67.90% for German. A 


MLP achieved 71.52% for English and 67.61% for German. Einally, a Naive Bayes (Eried- 


man, Geiger, and Goldszmidf 1997 1 classifier achieved 51.39% for English and 46.00% 
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for German. By appending word bigram information to the Memory-Based Learning 
approach, accuracy increased to 81.25% for English and 78.93% for German. 

Several other corpora were explored for dialog act recognition using a single ap¬ 
proach. For instance, [Samuel, Carberry and Vijay-shanker (1998 1 were able to achieve 
71.22% accuracy on the VERBMOBIL corpus, annotated with the 18-label domain in¬ 
dependent subset of the original tag set. For that, they used Transformation-Based 
Learning ([Brill 1995|l with a Monte Carlo strategy ([Metropolis and Ulam 1949 1 . On the 
SCHISMA corpus, [Keizer, op den Akker, and Nijholt (2002 1 used BNs with sentence 
type, subject type, and punctuation as features to achieve 44% accuracy. Using a switch¬ 
ing Dynamic Bayesian Netw ork (DBN) (jGhahram a ni 1998| with a trigram language 
model and prosodic features, [Dielmann and Renals (20071 obtained 60% accuracy on 
the AMI Meeting corpus. Following the same direct classification approach based on cue 
phrases applied to the Switchboard corpus, Webb and Ferguson (20101 achieved 58.14% 
accuracy on the ICSI-MRDA corpus. Finally, [Gamback, Olsson, and Tackstrom (2011) 
also used the SVM-based approach applied to the Switchboard corpus on the Drhana 
corpus. They performed experiments using the 248-label tag set, as well as a domain- 
independent subset with 72 tags, obtaining 90.97% and 94.08% accuracy, respectively. 

Overall, we can see that experiments on dialog act recognition have been widely 
spread both in terms of approaches and corpora. This makes it difficult to compare 
results, even for experiments on the same corpora, since different tag sets and evaluation 
procedures have been used. Still, on the most explored corpus. Switchboard, the SVM 
approach used by Gamback, Olsson, and Tackstrom (2011) seems to surpass the other 
approaches. 

In terms of features, lexical features, especially n-grams, are the most used. How¬ 
ever, acoustic-prosodic features have also been used, generally in experiments that did 
not involve textual information. Other features, such as sentence and subject t 5 rpe, are 
hard to obtain automatically and are themselves indicative of the dialog act. Thus, their 
identification can be seen as an intermediate step towards dialog act recognition. 

Finally, since we want to assess the influence of context information on dialog act 
recognition in the context of a dialog system, it is important to notice that approaches 
that predict the best dialog act sequence for a whole dialog or even ones that take ev¬ 
erything that happened since the beginning of the dialog into account when classifying 
a given segment are not indicated. This is true for two reasons. First, some of those 
approaches rely on future information, that is, they use information not available to 
a dialog system at the time of classification, to classify a given segment. The second 
reason is that when such approaches, it is hard to limit the amount of provided context 
information, making it difficult to control the analysis we want to perform. Thus, a non¬ 
sequential classification approach, to which context information can be provided in the 
form of different features which provide that sequential information, is more indicated. 


3. Corpora 

ISO 24617-2 is the current and only existent standard for dialog act annotation. How¬ 
ever, since it is a recent standard, the amount of data annotated according to it is 
small, which leaves room for questions regarding the solidity of the results achieved 
by experiments performed on it. Thus, we also performed experiments on the large and 
widely explored Switchboard corpus. 
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3.1 Switchboard 


Switchboard ( [Godfrey, Holliman, and McDaniel 1992) is a corpus consisting of abouf 
2400 telephone conversations among 543 American English speakers (302 male and 241 
female). Each pair of speakers was automafically attributed a topic for discussion, from 
70 different ones. Eurthermore, speaker pairing and topic attribution were constrained 
so that no two speakers would be paired with each other more than once and no one 
spoke more than once on a given topic. Speech from fhe two subjects was recorded into 
separate channels, using an 8 kHz sampling rate. 

A subset of 1155 manually transcribed conversations (annotated with disfluency, 
abandonment, and interruption information), containing 223606 segments, was anno¬ 
tated using the SWBD-DAMSL tag set ([Jurafsky Shriberg, and Biasca 1997). Dialog 
act annotation was performed by eighf Linguisfics graduafe students at University 
of Colorado Boulder (CU-Boulder) during a three-month period. The SWBD-DAMSL 
tag set was structured so that the annotators were able to label the conversations from 
franscripfions alone. If contains over 200 unique tag combinations. However, in order 
to obtain higher inter-annotator agreement and higher example frequencies per class, 
a less fine-grained set of 44 tags was devised. The class distribution (Tableis highly 
unbalanced, with the three most frequent classes — Statement-opinion (36%), Acknowl¬ 
edgement (19%), and Statement-non-opinion (13%) — covering 68% of fhe corpus. The sef 
can be reduced fo 43 or 42 cafegories (Sfolcke et al. 2000 [Rofaru 2002 Gamback, Olsson, 
and Tackstrom 2011|, if fhe Abandoned and Uninterpretable categories are merged, and 
depending on how the Segment category (used when the current segment is the contin¬ 
uation of the previous one by the same speaker) is treated. By analyzing the data, we 
came to the conclusion that merging segments labeled as Segment with the previous 
segment by the same speaker is the best approach, because some of the attributed 
labels only made sense when the segments were merged. Also, it makes sense to merge 
the Abandoned and Uninterpretable categories, because both represent disruptions in 
the dialog flow, which inferfere with the typical dialog act sequence. However, in our 
experiments, we used the three variants of the tag set to allow direct comparison with 
the related work. There is also a 41-category variant of the tag set (jWebb and Eerguson 
2010 1 , which merges the Statement-opinion and Statement-non-opinion categories, making 


the most frequent class cover 49% of fhe corpus. 

This subsef is called fhe Swifchboard Dialog Acf Corpus but is referred fo simply 
as Swifchboard in this article. Eigure shows an excerpt of one of the transcriptions, 
where each line corresponds to an annotated segment. jStolcke et al. (20Tj0 1 describe a 
data partition of this subset into a training set of 1115 conversations, a test set of 19 
conversations, and a future use set of 21 conversations. However, the concrete partition 
is not disclosed and, thus, in the remaining related bibliography there is no reference fo 
fhis partition and cross-validation is used for evaluation. 

We selected this corpus for our experimenfs because it contains a large amount 
of annotafed dafa, which can lead fo solid results. Eurthermore, it has been widely 
explored, which allows result comparison with previous work. Einally its tag set is 
domain-independent, which reduces the probability of drawing conclusions thaf de¬ 
pend on the domain of the corpus. 


3.2 ISO 24617-2 Data 


As stated in Section the ISO 24617-2 standard defines guidelines for dialog acf 
annofafion, including communicative functions in multiple dimensions, dependencies 
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Table 4 

Label distributio n in the Switchboard Dialog Act Corpus (Replicated from jjurafsky, Shriberg, 
and Biasca 1997^). 


Label 

Count 

% 

Label 

Count 

% 

Statement-non-opinion 

72824 

36 

Collab Completion 

699 

.4 

Acknowledgement 

37096 

19 

Repeat-Phrase 

660 

.3 

Statement-opinion 

25197 

13 

Open-Question 

632 

.3 

Agreement 

10820 

5 

Rhetorical-Question 

557 

.3 

Abandoned 

10569 

5 

Hold 

540 

.2 

Appreciation 

4663 

2 

Reject 

338 

.2 

Yes-No-Question 

4624 

2 

Neg Non-no Answer 

292 

.1 

Non-verbal 

3548 

2 

Non-understanding 

288 

.1 

Yes Answer 

2934 

1 

Other Answer 

279 

.1 

Conventional Closing 

2486 

1 

Conventional Opening 

220 

.1 

Uninterpretable 

2158 

1 

Or-Clause 

207 

.1 

Wh-Question 

1911 

1 

Dispreferred Answers 

205 

.1 

No Answer 

1340 

1 

3rd-party-talk 

115 

.1 

Response Acknowledge 

1277 

1 

Offers / Options 

109 

.1 

Hedge 

1182 

1 

Self-talk 

102 

.1 

Deck Yes-N o-Question 

1174 

1 

Downplayer 

100 

.1 

Other 

1074 

1 

Maybe 

98 

<.l 

Backchannel-Question 

1019 

1 

Tag-Question 

93 

<.l 

Quotation 

934 

.5 

Decl-Wh-Question 

80 

<.l 

Summarize 

919 

.5 

Apology 

76 

<.l 

Aff Non-yes Answer 

836 

.4 

Thanking 

67 

<.l 

Action Directive 

719 

.4 





Speaker A: Okay. / 

Speaker A: {D So,) 

Speaker B: [ [ I guess, + 

Speaker A: What kind of experience [ do you, + do you ] have, then with child care? 
Speaker B: I think, ] + {F uh,} I wonder ] if that worked. / 

Speaker A: Does it say something? / 

Speaker B: I think it usually does. / 

Speaker B: You might try, {F uh,} / 

Speaker B: I don't know, / 

Speaker B: hold it down a little longer, / 

Speaker B: {C and } see if it, {F uh,} -/ 

Speaker A: Okay <beep>. / 


Figure 1 

An excerpt of a Switchboard corpus transcription. Brackets are used to annotate different 
phenomena. Square brackets signal repetitions and corrections. Curly brackets signal 
disfluencies. 
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between dialog acts, and modifiers concerning, for insfance, condifionalify and parfial- 
ify. Nof all of fhese aspecfs are relevanf for our sfudies. In facf, alfhough if could be 
inferesfing fo analyze fhe influence of confexf informafion for all dimensions, only fhe 
fask dimension has enough diversify fo be analyzed using fhe same procedure as on fhe 
Swifchboard corpus. Thus, in fhe sfudies presenfed in fhis article, we only explore fhe 
influence of confexf informafion in fhe recognition of communicafive funcfions in fhe 
fask dimension. 

We decided fo perform experimenfs on dafa armofafed according fo fhis sfandard 
in an affempf fo confribufe fo fhe uniformizafion of research on dialog acfs. However, 
since fhe sfandard is relatively new, nof much dafa has been armofafed according fo 
ifs fag sef. In fhis sense, we were only able fo obfain fhe dafa provided by fhe Tilburg 
DialogBank ( |Bunf ef al. 2016[ l ^ Thus, fo obfain more dafa, we decided fo look info ofher 
armofafed dafasefs whose annofafions could be mapped info fhe ones of fhe sfandard. 
There are affempfs af converfing ofher annofafion formafs info fhe sfandard. For in¬ 
sfance, fhe SWBD-DAMSL used fo armofafe fhe Swifchboard corpus ( [Fang ef al. 2012[ |. 
However, fhese approaches involve manual sfeps which are highly time consuming. 
Thus, we looked info a differenf corpus, LEGO tSchmiff , Ulfes, and Minker 2012| , an 
armofafed subsef of fhe Lef's Go corpus i Raux ef al. 2006| , which has been used in 
many dialog relafed fasks and whose domain-dependenf dialog acf annofafions could 
be mapped info fhe communicafive funcfions defined by fhe sfandard almosf direcfly. 
Alfhough fhis is nof a complefe armofafion according fo fhe sfandard, if provided a large 
amounf of dafa for our sfudies in comparison fo whaf we were able fo obfain from fhe 
DialogBank. More defailed informafion abouf fhe dafasefs is provided below. 


3.2.1 Tilburg DialogBank. The Tilburg Universify DialogBank < |Bunf ef al. 2016) pro¬ 
vides multiple dialogs annofafed according fo fhe ISO 24617-2 sfandard. The dialogs 
are exfracfed from differenf corpora in mulfiple languages. Af fhe fime fhe sfudies 
presenfed in fhis arficle were performed, 11 English dialogs and 7 Dufch dialogs were 
available in fhe DialogBank, disfribufed as shown in Table If is imporfanf fo nofice 
fhaf fhe amounf of available dafa is small, especially in Dufch. In ferms of labels, 
informafion providing funcfions are dominanf overall, wifh fhe inform fag being presenf 
in around 13% of fhe segmenfs. 


Table 5 

Information about the dialogs obtained from the Tilburg DialogBank. 

Gorpus 

Language 

#Dialogs 

#Segments 

Dominant Tag 

Switchboard 

English 

2 

554 

inform (36%) 

TRAINS 

English 

3 

236 

inform (19%) 

HCRC Map Task 

English 

6 

2095 

insfrucf (14%) 

DIAMOND 

Dutch 

3 

88 

inform (14%) 

Dutch Map Task 

Dutch 

1 

93 

inform (19%) 

OVIS 

Dutch 

3 

91 

answer (12%) 

All 

English 

11 

2885 

inform (13%) 

All 

Dutch 

7 

272 

inform (14%) 


1 https://dialogbank.uvt.nl/ 
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3.2.2 LEGO. The LEGO corpus (Schmitt, Ultes, and Minker 2012) is an annotated subset 
of the The Let's Go corpus ( Raux et al. 2006| . It features data from the Carnegie Mellon 
University (CMU)'s Let's Go Bus Information System, which provides information 
about bus schedules in the city of Pittsburg, through spoken telephonic interaction with 
a dialog system. 

A set of 347 calls recorded during 2006 was later annotated by the Dialogue Systems 
Group at Ulm University, Germany. This annotated corpus is the one usually referred to 
as LEGO. It features 14186 utterances - 9083 system utterances and 5103 user utterances. 
The number of utterances is different because some system turns contain multiple 
utterances and some of its prompts are not answered by the user. There is no standard 
division of this corpus into training and testing sets. 

Since system utterances are generated through slot filling of fixed templates, they 
have no errors and contain casing and punctuation information. In contrast, the tran¬ 
scriptions of user utterances were obtained using an Automatic Speech Recognition 
(ASR) system and, thus, contain no casing, nor punctuation information. Eurthermore, 
the recognition was not always correct, due to the Word Error Rate (WER) of the ASR 
system, for which a concrete value is not revealed. Eigure presents an excerpt of a 
transcription. 


System: Welcome to the CMU Let's Go bus information system. 

System: To get help at any time, just say Help or press zero. 

System: What can I do for you? 

User: PENN AT BUTLER TIME IS IT IS THE CENTRE AT HIGHLAND EROM 
RUNS THE 61A EIGHT THIRTY 61B CAN'T OAKLAND 
System: Leaving from OAKLAND . Is this correct? 

User: OAKLAND 

System: Leaving from OAKLAND . Is this correct? 

User: YES 
System: Right. 

System: Where do you want to go? 

User: SQUIRREL HILL OE EIETH AVENUE 
System: Going to EIETH AVENUE . Is this correct? 

User: YES 


Figure 2 

An excerpt of a LEGO corpus transcription. 


In terms of dialog acts, the LEGO corpus was originally annotated using two dis¬ 
tinct and domain-dependent tag sets for system an user turns. The set for system turns 
contained 28 tags, such as Ask Destination, Ask Confirm Bus, and Ask Time, while the set 
for user turns contained 22 tags, such as Place Information, Confirm Destination, and Reject 
Bus. When using such tags, context information is clearly very important for dialog act 
recognition, since a given dialog act drastically reduces the number of non-disruptive 
possibilities, that is, that do not break the dialog flow, for the next one. Exploring dialog 
act recognition under these conditions is not relevant for our study. However, most of 
these domain-dependent tags can be directly mapped into ISO 24617-2 communicative 
functions. Thus, in order to obtain more data armotated according to the standard, we 
performed that mapping as described in ([Ribeiro, Ribeiro, and de Matos 2016 >. We did 


10 


















Book Reviews 


not take some of the dimensions into account, as the transcriptions did not contain 
enough information to allow annotations relative to those dimensions. However, since 
our study focuses on fhe Task dimension, informafion abouf fhose dimensions is nof 
relevanf. This way, we obfained over 4 times fhe number of armofafed segmenfs we 
were able fo obfain from fhe Tilburg DialogBank. The label disfribufion across fhe 
corpus is presenfed in Table bofh for fhe whole corpus and considering sysfem and 
user fums separafely In fhis sense, fhe nafure of fhe corpus is highly noticeable in fhe 
difference befween sysfem and user furns, with the system using mainly questions and 
instructions and the user answering those questions. 


Table 6 


Label distribution in the LEGO 

corpus. 




Label 

All 

Count 

% 

System 
Count % 

User 

Count 

% 

Check Question 

2257 

16 

2256 

25 

1 

<.l 

Set Question 

2197 

16 

1987 

22 

210 

4 

Instruct 

1918 

14 

1812 

20 

106 

2 

Answer 

1462 

10 

0 

0 

1462 

29 

Inform 

1256 

9 

656 

7 

600 

12 

Confirm 

1162 

8 

0 

0 

1162 

23 

Disconfirm 

1105 

8 

0 

0 

1105 

22 

Promise 

277 

2 

277 

3 

0 

0 

Requesf 

155 

1 

70 

1 

85 

2 

Suggesf 

40 

.3 

40 

.4 

0 

0 


4. Experimental Setup 

We approached dialog act recognition as a supervised classification task, following fhe 
fypical sfeps for fhis kind of fask. This section describes our options in ferms of feafure 
selection, classification approaches, and evaluation mefhodologies. 


4.1 Features 


Dialog acts are related to language and, consequently, to the words present in each 
utterance, as well as to the intonation of fhose utterances. This means fhaf bofh fexfual 
and audio feafures are imporfanf fo recognize dialog acfs, as was shown in some of 
fhe sfudies presented in Section [ 2 ] ([Wright 1998{ [Stolcke et al. 2000) |Dielmann and 
Renals 2007 Sridhar, Bangalore, and Narayanan 2009[ |. However, for the experiments 
presented in this document, we relied just on lexical features extracted from conver- 
safion franscripfs. We opfed for fhis approach since lexical feafures have been widely 
used and proved efficienf in fhe relafed work. This means fhaf fhe efforf of obfaining 
alignmenfs befween fhe audio and franscripfions of all segmenfs is unnecessary for 
sfudying confexf influence patterns. Nonefheless, we believe fhaf audio feafures would 
be able fo improve fhe overall classification resulfs and, fhus, fhey should be explored 
as fufure work. 
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The transcripts were subjected to a normalization step, consisting of converting all 
words to lowercase, appending tokens signaling start and end of each segment, and 
separating punctuation from the words. 

4.1.1 Base Features. We used the frequencies of word n-grams as the main features 
extracted from the current segment, as n-grams are able to capture ke 5 rwords and 
word sequence information. In order to select which n-grams to use, we performed 
experiments using a specific n-gram length, between 1 and 5, as well as using a cumu¬ 
lative n, also between 1 and 5. Since SVMs were used to obtain the best results on the 
Switchboard corpus ( [Gamback, Olsson, and Tackstrom 2011^ , we used an SVM classifier 
with only n-gram frequencies as features for these experiments. Table presents the 
results obtained on the 42-label Switchboard. We can see that using both unigrams and 
bigrams led to the best results. However, along each row, the result differences are not 
statistically significant. Nonetheless, for n larger than 2, there is statistical significance in 
the difference between the results obtained using a specific n and a cumulative n . This 
means that the information provided by unigrams and bigrams is relevant. Thus, the 
remaining experiments presented in this article used unigrams and bigrams as features. 


Table 7 

Accuracy (%) results obtained on the 42-label Switchboard corpus using an SVM classifier with 
n-gram frequencies as features. The first row presents results obtained using a specific n-gram 
length, while the second presents results obtained using a cumulative n, that is, using all 
n-grams with n between 1 and the pivot of the column. 

n 

1 2 3 4 5 

Specific n 7169 TSlOS 72h7 69h4 65.60 

Cumulative n 72.69 73.69 73.36 73.26 73.20 

In addition to n-grams, we also used the existence of wh-words and punctuation 
as features. The first provides important cues for question detection and the last may 
disambiguate different intentions behind the same words. For instance, exclamation 
marks may turn a statement into a command, while the placement of commas may 
change the whole meaning and intention of a sentence. 

4.1.2 Context Features. Since the focus of this article is the influence of context infor¬ 
mation on dialog act recognition, we used two different approaches to capture such 
information. The first one uses the n-grams extracted from the preceding segments as 
features of the segment being classified, while the second uses the dialog act classifica¬ 
tion of the preceding segments instead. While the first approach focuses on the sequence 
of words and sentences, the second focuses on the sequence of intentions. Furthermore, 
the first approach can be separated into two different approaches. One that uses the n- 
grams from the preceding segments directly and another that tags those n-grams with 
an index corresponding to the distance, in number of segments, between the segment 
they where extracted from and the current segment. The first considers all n-grams 
equally and, thus, focuses only on word sequences, while the second distinguishes 
the n-grams according to their origin and, thus, also considers sentence sequences and 
relative distances. 

In order to assess the range of influence, we performed experiments using context 
information extracted from the n preceding segments, with n between 1 and 5. 
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4.2 Classification 


As state d in Section [2) mul tiple dialog act recognition approaches, such as the one ap¬ 
plied by Stolcke et al. (2000) on the Switchboard corpus, try to predict the best dialog act 
sequence for a given conversation. However, our work focuses on dialog act recognition 
during a conversation between a dialog system and its conversational partner. In this 
scenario such approaches are not useful as, although it may have expectations, the 
system has no way to be certain of how the conversation will evolve. Thus, it must 
only rely on previous and current information. Furthermore, since we want to assess 
the influence of context information on dialog act recognition, we must be able to limit 
the amount of information provided. Thus, instead of a sequential approach, such as 
HMMs or CRFs, we opted for an approach based on SVMs, which have already been 
used on the state-of-the-art approach for this task on the Switchboard corpus (Gamback, 
Olsson, and Tackstrom 2011) . 

Due to the large number of features and the size of the Switchboard corpus, we 
opted for using the linear kernel, with 0.1 as the value of the cost parameter. For the 
experiments on the Switchboard corpus, we used LIBLINEAR ( Fan et al. 2008| to train 
the classifiers, since it is well-suited to deal with large amounts of data. For the experi¬ 
ments on corpora annotated according to the ISO 24617-2 standard, we took advantage 
of the Sequential Minimal Optimization (SMO) algorithm (|Platt 1998) implementation 
provided by the Weka Toolkit ([Hall et al. 2009|. 


4.3 Evaluation 


We use accuracy, that is, the ratio between the number of correct predictions and the 
total number of predictions, as the performance measure, since it has been consistently 
chosen as the measure to evaluate performance in dialog act recognition. 

Since there are no fully disclosed training and testing partitions of the corpora, 
a strict comparative study is not possible. Thus, we opted for using 10-fold cross- 
validation as the evaluation procedure. However, in order to perform comparisons with 
some of the related work, other numbers of folds were also used. Nonetheless, unless 
otherwise stated, the presented results were obtained using 10-fold cross-validation. 

In order to assess the statistical significance of the differences between the multiple 
results obtained on the same data, we defined a significance level of 5% and performed 
the Wilcoxon Signed-Rank Test ( [Wilcoxon 1945[ |. Thus, in this article, when we say that 
some difference is significant/insignificant, it means that the p-value of the test was 
below/above 5%. 

5. Results 


This section presents the results we obtained on the Switchboard corpus and data 
annotated according to the ISO 24617-2 standard, using different approaches to provide 
context information. As in Section]^ we present the results grouped by corpus to facil¬ 
itate the comparison between the multiple approaches. However, some cross-corpora 
remarks are also performed along this section and discussed in Section]^ 

5.1 Switchboard 


As stated in Section]^ we performed experiments using three variants of the SWBD- 
DAMSL tag set, with 42 to 44 tags. Tables and 10 show the results obtained using 
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the 42, 43, and 44-label tag sets, respectively. The first thing that should be noticed is 
that the baseline, that is, the accuracy result obtained without context information, is 
above 70% for every tag set variant - 73.69%, 70.59%, and 70.57%, respectively. 


Table 8 

Accuracy (%) results obtained on the 42-label Switchboard corpus using context information 
extracted from the n preceding segments in different forms. The first two rows refer to context 
information provided in the form of n-grams while the last refers to context information 
provided in the form of dialog act classifications. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Untagged N-Grams 

73.69 

57.72 

49.99 

45.88 

42.83 

40.54 

Index-Tagged N-Grams 

73.69 

74.92 

74.18 

73.65 

73.28 

73.27 

Dialog Act Labels 

73.69 

78.20 

78.88 

79.06 

79.03 

79.03 


Table 9 

Accuracy (%) results obtained on the 43-label Switchboard corpus using context information 
extracted from the n preceding segments in different forms. The first two rows refer to context 
information provided in the form of n-grams while the last refers to context information 
provided in the form of dialog act classifications. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Untagged N-Grams 

70.59 

57.45 

51.10 

44.66 

41.18 

38.52 

Index-Tagged N-Grams 

70.59 

72.98 

75.16 

74.78 

74.49 

74.44 

Dialog Act Labels 

70.59 

75.55 

76.21 

76.38 

76.38 

76.36 


Table 10 

Accuracy (%) results obtained on the 44-label Switchboard corpus using context information 
extracted from the n preceding segments in different forms. The first two rows refer to context 
information provided in the form of n-grams while the last refers to context information 
provided in the form of dialog act classifications. 

# Previous Segments 
0 1 2 3 4 5 

Untagged N-Grams 7057 5752 5L09 UM 4096 38.24 

Index-Tagged N-Grams 70.57 72.97 75.12 74.76 74.49 74.41 

Dialog Act Labels 70.57 75.56 76.29 76.42 76.40 76.36 

By looking at the rows corresponding to the context information provided in the 
form of untagged n-grams, we can see that it is detrimental, considerably decreasing 
accuracy for every tag set. Furthermore, the accuracy result significantly decreases as 
the number of preceding segments increases. This phenomenon can be explained by the 
fact that Switchboard dialogs do not have a fixed domain and, thus, except for social 
obligations, the occurrence of similar segment sequences is relatively rare throughout 
the corpus. Furthermore, since the dialogs have long segments, the addition of n-grams 
from preceding segments without distinguishing them from the ones of the current 
segment ends up giving more weight to the previous segments than to the current 
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one in terms of n-gram frequencies. This makes consecutive segmenfs have similar 
frequencies in spife of having differenf classifications. Furfhermore, similarify increases 
as n increases, since only fhe mosf disfanf segmenf is replaced by fhe new segmenf. This 
ends up reducing enfropy among fhe differenf segmenfs and, consequenfly impairing 
fhe classification process and explaining fhe resulfs. 

On the other hand, context information provided in the form of index-fagged n- 
grams was able fo significanfly improve fhe baseline between 1.23 and 4.57 percenfage 
poinfs. This happened because index fags provide additional sequence information 
and effectively distinguish n-grams extracted from differenf segmenfs, making each 
segment more distinct from fhe ofhers and, thus, increasing entropy in an important 
way for a classificafion fask wifh a large number of classes. However, fhe influence 
pattern seems fo be differenf for fhe multiple varianfs of fhe fag sef. We can see fhaf 
informafion exfracfed from fhe firsf preceding segmenf is able fo significanfly improve 
accuracy for every varianf, befween 1.23 and 2.40 percenfage poinfs. However, for fhe 
42-label varianf, informafion exfracfed from additional segmenfs significanfly reduces 
accuracy, even fo values below fhe baseline after fhe second preceding segment. On 
the other hand, information extracted from fhe second preceding segment is still able 
to significantly improve accuracy by an additional 2.18 and 2.15 percentage points for 
fhe 43 and 44-label varianfs, respecfively. Beyond fhaf, accuracy sfarfs decreasing, buf 
never gefs below fhe baseline, nor even below fhe accuracy obfained using confexf 
informafion exfracfed from a single preceding segmenf. 

The lasf row of fhe fables concerns confexf informafion provided in fhe form of 
dialog acf classifications, fhaf is, fhe labels affribufed fo fhe preceding segmenfs. We can 
see fhaf by appending fhe classificafion of a single preceding segmenf, accuracy sig¬ 
nificanfly increased for every fag sef varianf, befween 4.51 and 4.99. Approximafely an 
addifional percenfage poinf can be added fo fhis value by appending informafion from 
additional segments, until the results start to stabilize. However, the improvements 
provided by preceding segments beyond the second are not statistically significant 
for any fag sef varianf. Furfhermore, for fhe 42-label varianf, even fhe improvemenf 
provided by fhe second preceding segmenf is nof sfafisfically significant. This reinforces 
fhe imporfance of fhe firsf preceding segmenf and suggesfs fhat fhe influence of confexf 
informafion highly decreases wifh fhe disfance befween segmenfs. 

Figure]^ shows fhaf fhe approach fhaf used dialog acf labels as confexf informafion 
surpassed fhe ones fhaf used n-grams for every fag sef varianf. However, fhe labels used 
were fhe manual annofafions of fhe corpus and, fhus, fhe obfained resulfs are an upper 
bound for fhe approach. In order fo assess fhe performance of fhis approach wifhouf 
relying on gold sfandard annofations, we performed experimenfs using aufomafic clas¬ 
sifications. To obtain the automatic classifications, we split the corpus in half and frained 
classifiers wifhouf confexf informafion on differenf subsefs of fhe corpus and used fhem 
fo predicf fhe labels for fhe second half. We used fhree different subsets - the second half, 
fhe whole corpus, and fhe firsf half - fo assess fhe impacf of fhe dependence befween 
fhe framing and evaluation sefs, from complefe dependence when using fhe second half 
fo frain fo complefe independence when using fhe firsf half. The accuracy of fhe labels 
is presenfed in Table 11 As expecfed, fhe accuracy when using an independenf sef fo 


frain is much lower fhan when using a dependenf sef. 

In order fo assess fhe performance when using fhe aufomafically obfained labels 
as confexf informafion, we frained classifiers on fhe half of fhe corpus for which fhey 
were predicfed. We also trained classifiers on fhaf dafa using fhe manual armofafions fo 
assess fhe decrease in accuracy. The 10-fold cross-validation resulfs obfained by fhese 
classifiers are presented in Tables 12 13 and 14 respectively. It is interesting to notice 
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Figure 3 

Accuracy (%) results obtained on the Switchboard corpus using context information extracted 
from the n preceding segments in different forms. 


Table 11 

Accuracy (%) results on the second half of the Switchboard corpus obtained by classifiers 
without context information trained using different subsets of the corpus. 

# Labels 

42 43 44 

Second Half 86^88 85A5 8K2(r 

Whole Corpus 85.30 83.70 83.76 

First Half 71.53 69.59 69.73 


that the decrease in accuracy of the classifiers without context information in relation 
to the ones trained on the whole corpus was of just 0.49 percentage points for the 42- 
label tag set variant and 1.08 and 1.04 for the 43 and 44-label variants, respectively. 
In terms of the classifiers using context information, the first thing that should be 
noticed, and that can also be seen in Figure is that, as expected, the accuracy of the 
classifiers decreased as the accuracy of the labels used to provide context information 
decreased. Furthermore, it is important to notice that the influence patterns observed 
when using manual annotations remained the same when using automatically obtained 
labels. In this sense, the first preceding segment was always the most informative, with 
the following providing smaller and smaller amounts of additional information, which 
typically led to accuracy increments without statistical significance. Talking about sta¬ 
tistical significance, there is no significance between the results obtained when using 
the labels predicted by the classifier trained on the whole corpus and the ones predicted 
by the classifier trained on the second half. Furthermore, for the 43 and 44-label tag 
set variants, there is also no statistical significance between the results obtained using 
those labels and the ones obtained using manual armotations. However, the decrease in 
accuracy when using the labels predicted by the classifier trained on the first half of the 
corpus is always significant. In this sense, the decrease in relation to when using manual 
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annotations was of 2.81 percentage points for fhe 42-label varianf, 2.68 for fhe 43-label 
varianf, and 2.65 for fhe 44-label varianf. 



Figure 4 

Accuracy (%) results obtained on the second half of the Switchboard corpus using context 
information extracted from the n preceding segments in the form of dialog act labels. 


Table 12 

Accuracy (%) results obtained on the second half of the 42-label Switchboard corpus using 
context information extracted from the n preceding segments in the form of dialog act labels. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Manual Annofations 

73.20 

77.37 

78.08 

78.13 

78.15 

78.12 

Second Half 

73.20 

76.10 

76.88 

77.05 

77.14 

77.05 

Whole Corpus 

73.20 

75.99 

76.83 

76.94 

77.00 

76.99 

Firsf Half 

73.20 

74.80 

75.21 

75.31 

75.34 

75.26 


Table 13 

Accuracy (%) results obtained on the second half of the 43-label Switchboard corpus using 
context information extracted from the n preceding segments in the form of dialog act labels. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Manual Annofations 

69.51 

74.39 

74.96 

75.08 

75.07 

75.06 

Second Half 

69.51 

73.01 

73.89 

73.88 

73.96 

73.83 

Whole Corpus 

69.51 

72.87 

73.72 

73.81 

73.84 

73.76 

Firsf Half 

69.51 

71.65 

72.40 

72.40 

72.29 

72.23 


In order to compare the results obtained using context information in the form of 
aufomafic dialog acf labels wifh fhe ones obfained using n-grams, we also framed classi¬ 
fiers on fhe second half of fhe corpus using index-fagged n-grams. Table[T5|presenfs fhe 
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Table 14 

Accuracy (%) results obtained on the second half of fhe 44-label Switchboard corpus using 
context information extracted from the n preceding segments in the form of dialog acf labels. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Manual Armotations 

69.53 

74.42 

75.00 

75.11 

75.12 

75.09 

Second Half 

69.53 

72.96 

73.89 

73.90 

73.97 

73.83 

Whole Corpus 

69.53 

72.92 

73.84 

73.87 

73.91 

73.76 

First Half 

69.53 

71.69 

72.46 

72.47 

72.30 

72.28 


obtained results. We can see that the results follow the same patterns as on the whole 
corpus, with only one preceding segment providing relevant information for the 42- 
label tag set variant, while for the other variants the second preceding segment is still 
able to improve accuracy. In Figure we can see that for the 42-label tag set variant 
the results obtained using index-tagged n-grams were always below the ones obtained 
using automatic dialog act labels, even when they were obtained using a classifier 
trained on the first half of the corpus. On the other hand, for the other variants, the 
results are around 1.50 percentage points above the ones obtained using automatic 
labels predicted by a classifier trained on the first half of the corpus. Furthermore, 
although the results are still below the ones obtained using automatic labels predicted 
by a classifier trained on the second half of the corpus, that difference is not statistically 
significant. 



Figure 5 

Accuracy (%) results obfained on fhe second half of fhe Switchboard corpus using context 
information extracted from fhe n preceding segments in the form of automafic dialog acf labels 
and index-tagged n-grams. 


Overall, on the Switchboard corpus, we can see that the results obtained using the 
43 and 44-label tag set variants are very similar. However, they differ from the ones 
obtained using the 42-label variant, even in terms of the observable influence patterns 
as the amount of context information increased. This means that the way in which the 
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Table 15 

Accuracy (%) results obtained on the second half of the Switchboard corpus using context 
information extracted from the n preceding segments in the form of index-tagged n-grams. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

42 Labels 

73.20 

73.99 

73.33 

72.73 

72.54 

72.42 

43 Labels 

69.51 

71.66 

73.95 

73.82 

73.63 

73.43 

44 Labels 

69.53 

71.64 

73.92 

73.78 

73.64 

73.42 


Segment category is treated is relevant for the task. In this sense, as stated in Section 3.1 


we believe that using the 42-label tag set variant, merging segments labeled as Segment 
with the previous segment by the same speaker is the best approach. Nonetheless, there 
are still conclusions which can be drawn independently of the tag set variant. Context 
information provided in the form of dialog acf labels is pofenfially more informafive 
fhan in the form of n-grams, especially unfagged ones, which have a negafive impacf 
on accuracy. We used fhe ferm pofenfially as if depends on fhe accuracy of fhose labels. 
However, for fhe 42-label fag sef varianf, fhat was frue for all fhe experimenfs using 
aufomafically generafed labels. Furfhermore, fhe firsf preceding segment was clearly 
the most informative both when using index-tagged n-grams and dialog act labels 
as context features. Beyond that, results obtained using index-tagged n-grams were 
irregular for fhe differenf fag sef varianfs. While for fhe 42-label fag sef varianf accuracy 
sfarfed decreasing affer fhe firsf preceding segmenf, for fhe remaining varianfs fhe 
second previous segmenf was sfill informafive. On fhe ofher hand, fhe resulfs obfained 
using dialog acf labels, bofh manual and aufomafic, followed a pattern fhaf suggesfs 
a high decrease of influence in relafion fo fhe disfance fo fhe segmenf being classified, 
with smaller and smaller accuracy increments as the distance increased. 

In order to compare our results with previous results on the Switchboard corpus 
it is important to notice that our normalization step did not take the characteristics of 
fhe franscriptions into account. However, the transcriptions of fhe Swifchboard corpus 
include disfluency, abandonment, and interruption annotations, which were processed 
in the same marmer as the remaining words when n-grams were extracted. By altering 
the normalization step to take these annotations into account, that is, not splitting them 
and considering them a single token, we were able to improve the best results to 79.60, 
78.00, and 77.90, which is particularly significant for fhe 43 and 44-label fag sef varianfs. 
These resulfs surpassed fhe ones obfained by Gamback, Olsson, and Tacksfrom (2011) 
for every fag sef varianf. On fhe 42-label variant, the accuracy improvement exceeded 
3 percentage points. However, the results obtained using context information in the 
form of index-fagged n-grams were lower fhan fhe reported in their article using 
similar information. This suggests that their active learning approach is, in fact, able 
to improve results. Since the concrete corpus partition used by [Stolcke et al. (2000 1 is 
not disclosed in their paper, we performed 50-fold cross-validation to obtain results 
using the same number of framing and fesfing examples as described and, fhus, fry 
fo obfain more comparable resulfs. Using fhis sefup, we were able fo obfain 79.60% 
accuracy, which represenfs an accuracy improvemenf exceeding 8 percenfage poinfs. 
In order fo compare our resulfs wifh fhe ones obfained by Webb and Ferguson (20101, 
we also performed an experimenf using fhe 41-label varianf of fhe fag sef, by merging 
fhe sfafemenf cafegories. Under fhese condifions, we obfained 86.50% accuracy, which 
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represents an improvement of almost 6 percentage points. Taking this comparison 
into account, we believe that our results are the current state-of-the-art for dialog act 
recognition on the Switchboard corpus. 


5.2 ISO 24617-2 Data 

The experiments performed on data armotated according to the ISO 24617-2 standard 
using context information in the form of n-grams, both untagged and index-tagged, 
were identical to the ones performed on the Switchboard corpus. However, the ex¬ 
periments using context information in the form of dialog act labels differed in two 
aspects. First, we only used manual annotations. Second, we performed experiments 
using information about all dimensions, as well as using information related to the 
Task dimension only. We did not include experiments using automatic annotations for 
different reasons according to the source of the data. On the LEGO corpus, the system 
is always aware of the dialog act it produced. Thus, it would be unrealistic to automat¬ 
ically predict those as well. On the other hand, by predicting only the user dialog acts, 
the experiment would be inherently different from the ones performed on the remaining 
corpora. On data obtained from the Tilburg DialogBank the main reason is related to 
its small amount, especially for Dutch, because experiments that split the data even 
further would drastically reduce accuracy. Furthermore, we would have to automat¬ 
ically produce labels for the remaining dimensions as well. This is problematic since 
the distribution and nature of the remaining dimensions is completely different from 
the Task dimension. Thus, different classifiers or even rule-based approaches would be 
more indicated to predict dialog acts in those dimensions. Still, we used the manual 
annotations to perform experiments using context information from all dimensions in 
an attempt to assess dependencies between the Task dimension and the others. 

Concerning data obtained from the Tilburg DialogBank, we performed experiments 
on both English and Dutch dialogs in order to assess possible language-independent 
results. Tablejl^presents the results obtained on English dialogs. We can see that, in this 
case, context information in the form of untagged n-grams led to much more irregular 
results than in the case of the Switchboard corpus. Accuracy insignificantly decreases 
0.80 percentage points below the baseline when using a single preceding segment but 
beyond that it starts increasing, from 0.45 percentage points beyond the baseline when 
using 2 preceding segments up to a significant 3.43 when using 5 preceding segments. 
However, as shown in Figure ^ this is still the approach with worst performance. For 
the remaining approaches, the first preceding segment is still the one that leads to the 
largest accuracy boost. In this sense, the pattern produced by index-tagged n-grams 
is a mix between the ones produced on the Switchboard corpus. The first preceding 
segment significantly improves accuracy by 3.71 percentage points and, beyond that, 
accuracy starts decreasing but never below the baseline. However, these differences are 
all statistically insignificant. Nonetheless, the 0.28 percentage point difference between 
the best results obtained using untagged n-grams and tagged n-grams is statistically 
significant. Finally, the results obtained using dialog act armotations reveal the same 
influence patterns as on the Switchboard corpus, with a noticeable decrease of influence 
in relation to the distance to the segment being classified. Furthermore, this approach 
was also the one that performed better, with a 17.54 percentage point improvement 
over the baseline, versus the 3.71 percentage points of index-tagged n-grams. Informa¬ 
tion provided by dimensions other than Task was able to improve accuracy, but only 
by 0.25 percentage points. However, if we consider a single preceding segment, the 
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improvement is of 2.18 percentage points, which is more pronounced. Nonetheless, in 
both cases, the result difference is statistically insignificant. 



1 2 3 4 5 

# Preceding Segments 


82 


80 


78 


70 


68 


- Untagged N-Grams 

Index-Tagged N-Grams 
Dialog Act Labels (Task) 
Dialog Act Labels (All) 


Figure 6 

Accuracy (%) results obtained on the Tilburg DialogBank English dialogs using context 
information extracted from the n preceding segments in different forms. 


Table 16 

Accuracy (%) results obtained on the Tilburg DialogBank English dialogs using context 
information extracted from the n preceding segments in different forms. The first two rows refer 
to context information provided in the form of n-grams while the remaining two refer to context 
information provided in the form of dialog act classifications relative to the Task dimension only 
or all the dimensions. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Unfagged N-Grams 

65.79 

64.99 

66.24 

67.31 

67.97 

69.22 

Index-Tagged N-Grams 

65.79 

69.50 

68.49 

69.25 

69.18 

69.01 

Dialog Acf Labels (Task) 

65.79 

77.68 

81.94 

82.29 

82.88 

83.08 

Dialog Acf Labels (All) 

65.79 

79.86 

82.60 

82.63 

82.95 

83.33 


The results obtained on Dutch dialogs are presented in Table It shows that, as 
expected, accuracy results are lower than the ones obtained on English dialogs, since 
the amount of dafa is smaller. For fhe same reason, the influence pafferns seem more 
irregular fhan in the previous cases, as can be seen in Figure]^ However, the importance 
of confexf informafion is sfill noficeable. Confrarily to what happened with the English 
dialogs, using context information in the form of unfagged n-grams followed a defri- 
mental pattern as on fhe Swifchboard corpus. On fhe ofher hand, index-fagged n-grams 
improved accuracy up fo fhe fhird preceding segmenf, obfaining fhe besf resulf on 
fhis dafasef wifh 3.68 percenfage poinfs over fhe baseline. However, fhe firsf preceding 
segmenf was sfill fhe mosf informafive, leading fo an accuracy improvemenf of 2.94 
percenfage poinfs. Beyond fhe fhird preceding segmenf, accuracy sfarfed significanfly 
decreasing, even to values below the baseline. Context information provided in the form 
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of dialog act labels was less effective on this dataset, with a maximum improvement of 
2.94 percentage points over the baseline. Furthermore, the results did not follow the 
same pattern as on the previously described experiments. In fact, there is no statistical 
significance in any of the differences for this approach. This suggests that providing 
context information in the form of dialog act labels requires larger amounts of training 
data to be effective in comparison to information provided in the form of index-tagged 
n-grams. As for information provided by other dimensions, as on the English dialogs, it 
was also able to slightly improve accuracy, in this case by a maximum of 1.83 percentage 
points. Given the irregular patterns obtained for Dutch, it is hard to draw language- 
independent conclusions. However, the importance of context information is noticeable 
in both cases, as well as the importance of representing context information in a form 
distinguishable from the information related to the segment being classified, that is, in 
a way that increases entropy. This is shown in both languages by the improvements 
provided by index-tagged n-grams and dialog act labels and the detrimental impact 
of untagged n-grams. Still, we believe that experiments using larger amounts of data 
in languages other than English could lead to interesting conclusions regarding the 
language-independence of the influence of context on dialog act recognition. 



Untagged N-Grams 
Index-Tagged N-Grams 
■■ Dialog Act Labels (Task) 
Dialog Act Labels (All) 


# Preceding Segments 


Figure 7 

Accuracy (%) results obtained on the Tilburg DialogBank Dutch dialogs using context 
information extracted from the n preceding segments in different forms. 


The dialogs of the LEGO corpus have a different nature from all the previous 
ones, since they consist on human-machine interactions. Eurthermore, system segments 
are generated using templates and slot filling. Thus, variations in system segments 
annotated with the same dialog act label are in small number. This highly impacts 
accuracy, as can be seen in Tables 18 and 19 which present the results on the whole 
corpus and on the user segments only, respectively. The baseline accuracy difference is 
14.19 percentage points. Eurthermore, in a real situation, the system is aware of all the 
dialog acts it produced. Thus, it is more interesting to analyze user segments only, that 
is, the results in Table [T^ 

The first thing to notice, and which can also be seen in Eigure is that the results 
are much more similar between approaches than on the remaining corpora, with the dif¬ 
ferences in top results being below 0.50 percentage points and statistically insignificant. 
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Table 17 

Accuracy (%) results obtained on the Tilburg DialogBank Dutch dialogs using context 
information extracted from the n preceding segments in different forms. The firsf two rows refer 
fo confexf information provided in the form of n-grams while the remaining two refer fo confexf 
information provided in the form of dialog act classifications relative to the Task dimension only 
or all the dimensions. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Unfagged N-Grams 

62.13 

52.94 

49.26 

49.26 

50.00 

50.00 

Index-Tagged N-Grams 

62.13 

65.07 

65.44 

65.81 

63.60 

60.66 

Dialog Acf Labels (Task) 

62.13 

61.40 

62.87 

62.13 

63.24 

63.24 

Dialog Acf Labels (All) 

62.13 

62.13 

63.24 

63.24 

64.71 

65.07 


Table 18 

Accuracy (%) results obtained on the LEGO corpus using context information extracted from the 
n previous segments in different forms. The first two rows refer to context information provided 
in the form of n-grams while fhe remaining two refer to context information provided in the 
form of dialog act classifications relative to the Task dimension only or all the dimensions. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Unfagged N-Grams 

91.44 

95.36 

92.11 

88.66 

82.36 

83.58 

Index-Tagged N-Grams 

91.44 

95.92 

95.86 

95.64 

95.40 

95.26 

Manual Annofafions (Task) 

91.44 

94.81 

95.50 

95.72 

95.83 

95.78 

Manual Annofafions (All) 

91.44 

95.65 

95.85 

95.85 

95.94 

95.85 


This can be explained by the characteristics of the system segments, which improve the 
accuracy of fhe approaches fhaf provide confexf information in fhe form of n-grams. 
This becomes even clearer when looking af fhe patterns produced by appending addi¬ 
tional preceding segmenfs. The besf resulf, wifh an improvemenf above 10 percenfage 
poinfs over fhe baseline, is obfained by appending a single segmenf, which, since we 
are looking af resulfs for user segmenfs, is a fixed sysfem segmenf. In fhese cases, fhe 
n-grams exfracfed from fhaf segmenf appear mulfiple times in fhe corpus, wifh only 
a few possible following dialog acf labels. Thus, fhey provide a very imporfanf cue 
for fhe classifier and accuracy is highly improved. Previous segmenfs beyond fhe firsf 
sfarf fo reduce accuracy, as user segmenfs are now faken info accounf. However, we 
can still notice fhaf fhe decrease is much more pronounced for unfagged n-grams fhan 
for index-fagged n-grams, as fhe firsf approach obfains resulfs below fhe baseline after 
appending fhe fourfh preceding segmenf, while fhe laffer sfill obfains resulfs exceed¬ 
ing 10 percenfage poinfs above fhe baseline. As for fhe approaches based on confexf 
informafion provided in fhe form of dialog acf labels, if is inferesfing fo notice fhaf 
fhe influence paffern revealed on fhe Swifchboard corpus and fhe Tilburg DialogBank 
dialogs is also presenf on fhe LEGO corpus, wifh informafion from fhe firsf preceding 
segmenf leading fo a large increase in accuracy and informafion from fhe following 
leading fo smaller and smaller incremenfs. However, in fhis case, fhe improvemenf 
provided by informafion exfracfed from fhe firsf previous segmenf is nof as high as 
for fhe n-gram-based approaches. This shows fhaf fhe n-grams from fhe fixed sysfem 
segmenfs are able fo provide more fine-grained informafion fhan fhe simple dialog acf 
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label. However, this can be explained by the nature of the corpus itself, as the kind 
of information or instruction provided by the system highly limits the possible user 
dialog acts. Finally, it is important to notice that, once again, information from other 
dimensions significantly improved accuracy by 2.21 percentage points when using a 
single preceding segment, but only an insignificant 0.27 percentage points on the top 
results. 



2 3 

# Preceding Segments 


70 


Untagged N-Grams 
Index-Tagged N-Grams 
Dialog Act Labels (Task) 
Dialog Act Labels (All) 


Figure 8 

Accuracy (%) results obtained on the user segments of the LEGO corpus using context 
information extracted from the n preceding segments in different forms. 


Table 19 

Accuracy (%) results obtained on the user segments of the LEGO corpus using context 
information extracted from the n preceding segments in different forms. The first two rows refer 
to context information provided in the form of n-grams while the remaining two refer to context 
information provided in the form of dialog act classifications relative to the Task dimension only 
or all the dimensions. 


# Previous Segments 



0 

1 

2 

3 

4 

5 

Untagged N-Grams 

77.25 

88.67 

84.53 

79.58 

69.90 

73.37 

Index-Tagged N-Grams 

77.25 

88.87 

88.69 

87.91 

87.69 

87.12 

Manual Armotations (Task) 

77.25 

85.99 

87.81 

88.10 

88.48 

88.43 

Manual Annotations (All) 

77.25 

88.20 

88.65 

88.75 

88.67 

88.73 


Overall, the experiments on data armotated according to the ISO 24617-2 standard 
led to results and patterns similar to the ones obtained on the Switchboard corpus. 
This is important, since it means that our conclusions are not specific to one dialog 
act annotation tag set. Furthermore, except for some aspects, the conclusions are also 
corpora-independent. The main differences occurred on experiments on the LEGO 
corpus, on which there were almost no performance differences between approaches in 
terms of maximum accuracy. However, this was explained by the nature of the system 
segments that were used to provide context information. In terms of multilinguality, not 
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many conclusions could be drawn, since the results on the Dutch dialogs obtained from 
the Tilburg DialogBank led to irregular patterns, probably due to the reduced amount 
of dafa. However, fhe imporfance of confexf information and some of fhe patterns were 
still noticeable. 

6. Discussion 

In fhis documenf, we presenfed an analysis of fhe influence of confexf informafion on 
dialog acf recognition. The analysis was performed on fhe widely explored Swifchboard 
corpus, as well as on dafa armofafed according fo fhe recenf ISO 24617-2 sfandard. While 
fhe firsf was chosen for ifs large amounf of dafa and for fhe sake of comparison wifh 
previous research in fhe area, fhe latter was chosen in an affempf fo confribufe for fhe 
sfandardizafion of experimenfs in fhe area. In fhis sense, in addifion fo dafa obfained 
from fhe Tilburg DialogBank, we also used fhe LEGO corpus by mapping fhe original 
armofafions of fhe corpus info fhe communicafive funcfions of fhe sfandard. 

Confexf informafion was obfained from one up fo five preceding segmenfs and 
provided in fhree ways. The firsf, unfagged n-grams, fhaf is, using n-grams from pre¬ 
vious segmenfs in an indistinguishable way from fhe ones of fhe currenf segmenf, was 
generally defrimenfal. The only exception was on fhe LEGO corpus, where unfagged 
n-grams from fhe firsf preceding segmenf were able fo improve fhe baseline accuracy. 
However, fhis was due fo fhe rigid nafure of fhe sysfem segmenfs in fhe corpus and 
fhe approach was still outperformed by fhe ofhers on fhe same dafasef. The second 
approach fo provide confexf informafion was in fhe form of index-fagged n-grams, 
fhaf is, n-grams fagged wifh fhe disfance befween fhe segmenf fhey were exfracfed 
from and fhe currenf segmenf. In fhis case, accuracy highly improved using a single 
previous segmenf. However, beyond fhaf, fhere were no visible improvemenfs. Einally, 
informafion provided in fhe form of dialog acf classifications was able fo gradually 
improve accuracy and revealed similar influence patterns on every corpus. In fhis sense, 
fhe influence of preceding segmenfs seemed fo decrease exponenfially wifh fhe disfance. 
Eurfhermore, if is imporfanf fo nofice fhaf fhe same pafferns were verified even when 
using aufomafic armofafions insfead of fhe manual armofafions of fhe gold sfandard. 
Also, in fhe case of dafa annofafed according fo fhe ISO 24617-2 sfandard, including 
informafion from dimensions ofher fhan Task led fo slighf accuracy improvemenfs, up 
fo a maximum of 2 percenfage poinfs. However, in general, fhese improvemenfs were 
nof significanf. 

In ferms of fhe language independence of fhe conclusions, if is difficulf fo make 
any particular assessmenf, since we were only able fo obfain a reduced amounf of non- 
English dafa and, fhus, fhe obfained resulfs were irregular. However, fhe imporfance of 
confexf informafion is still highly noticeable and some of fhe influence pafferns are sfill 
observable. 

Overall, our experimenfs proved fhaf confexf informafion exfracfed from preceding 
segmenfs is able fo improve classificafion performance on fhe dialog acf recognition 
fask, independenfly of corpora characferisfics, fag sefs, and language. However, fhaf in¬ 
formafion should be provided in a manner disfinguishable from informafion exfracfed 
from fhe currenf segmenf, fhaf is, fhe feafures representing confexf informafion should 
be disfincf from fhe ones representing fhe currenf segmenf. Ofherwise, if may have a 
negative effecf. This disfincfion can be made eifher by fagging fhe feafures wifh an index 
relafive fo fhe segmenf fhey were exfracfed from, or by using differenf kinds of feafures 
for confexf and currenf segmenf informafion. The firsf preceding segmenf is fhe mosf 
informafive, f 5 rpically leading fo fhe besf resulfs when using n-grams and fhe largesf 
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performance improvement when using dialog act classifications. In this sense, it is not 
recommended to use information from additional preceding segments when using n- 
grams, as the outcomes varied among the different corpora. On the other hand, the 
approach based on dialog act classifications benefits from information from additional 
segments, until the results start to stabilize, around the third preceding segment. Finally, 
in terms of overall performance, the approach based on dialog act labels typically 
achieved the best results, even when using automatic annotations. 

Finally, in addition to the conclusions about the importance and influence of context 
information, it is important to notice that our experiments on the Switchboard corpus 
also led to results that advanced the state-of-the-art on the dialog act recognition task 
on that corpus, using every variant of the tag set. Furthermore, the results obtained on 
data annotated according to the ISO 24617-2 standard define a baseline for future work 
and contribute to the uniformization of experiments in the area. 

7. Future Work 


In our experiments, we only considered textual features. However, the studies by 
Wright (1998|, Stolcke et al. (2000) , Dielmarm and Renals (2007) , and Sridhar, Bangalore, 


and Narayanan (2009^ show that audio features are also able to provide important 


information for the dialog act classification task and are not influenced by ASR errors. 
Thus, it is our intention to perform further experiments, exploring the ability of acoustic- 
prosodic features to provide context information for dialog act recognition. 

Considering ASR, it would be interesting to analyze how WER influences the per¬ 
formance of context features. Since we have manual transcriptions of the Switchboard 
corpus, this can be done by generating automatic transcriptions of the same dialogs and 
observing the differences in performance. 

Furthermore, concerning the ISO 24617-2 standard, it would be interesting to per¬ 
form experiments to identify communicative functions on the other dimensions and 
assess whether the preceding segments are able to provide important information for 
those dimensions as well. 

Finally, it is important to obtain more annotated data in non-English languages, so 
that more extensive studies concerning the language-independence of our conclusions 
can be performed. 
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